Exploring influential nodes using global and local information

In complex networks, key nodes are important factors that directly affect network structure and functions. Therefore, accurate mining and identification of key nodes are crucial to achieving better control and a higher utilization rate of complex networks. To address this problem, this paper proposes an accurate and efficient algorithm for critical node mining. The influential nodes are determined using both global and local information (GLI) to solve the shortcoming of the existing key node identification methods that consider either local or global information. The proposed method considers two main factors, global and local influences. The global influence is determined using the K-shell hierarchical information of a node, and local influence is obtained considering the number of edges connected by the node and the given values of adjacent nodes. The given values of adjacent nodes are determined based on the degree and K-shell hierarchical information. Further, the similarity coefficient of neighbors is considered, which enhances the differentiation degree of the adjacent given values. The proposed method solves the problems of the high complexity of global information-based algorithms and the low accuracy of local information-based algorithms. The proposed method is verified by simulation experiments using the SIR and SI models as a reference, and twelve typical real-world networks are used for the comparison. The proposed GLI algorithm is compared with several common algorithms at different periods. The comparison results show that the GLI algorithm can effectively explore influential nodes in complex networks.

Basic ideas of proposed GLI method. The factors should be considered comprehensively from both global and local aspects. For instance, an important project in the real world can usually be decomposed into multiple subprojects, so a team responsible for completing the project needs to be divided into groups, each of which will be responsible for one subproject. Such a team can be regarded as a complex network where team members denote nodes, and the influence of each member depends on his position in the network. The higher the position is, the greater the number of resources will be; for instance, the project team leader is more influential than the group leader, and the group leader is more influential than the ordinary members. At the same time, the self-capability and the help provided by the team members are also important factors that define the influence. If members A and B have many common neighbors, member B is considered to be close to member A, members B and A have a great similarity, and member B has a greater influence on member A. The more similar the other members are with a particular member, the greater their influence on the member is. Accordingly, the proposed GLI method considers the contribution of both global and local information. First, global information is represented through network hierarchy obtained by the K-shell method; next, local information is represented by the degree of self and adjacent nodes, and adjacent nodes' Ks values. In addition, the similarity coefficient is introduced, and the higher the similarity between the nodes is, the greater the contribution provided by the adjacent nodes will be.

Contribution of proposed GLI.
This study provides an innovative research perspective for the identification of key nodes in a complex network. The proposed GLI algorithm's innovation is mainly reflected in three aspects, which are as follows: 1. An accurate key node identification method, which considers the influence of nodes from both global and local aspects, is proposed. The shortcomings of the existing coarse-grained methods based on global information are addressed, and the accuracy of the key node identification is improved; 2. The proposed GLI method uses the similarity coefficient between nodes in a network, which enhances the differentiation ability of adjacent gives values and improves the identification ability of key nodes; 3. The proposed GLI method considers both global and local information, which makes it highly practical and suitable for large and complex networks.
The rest of this article is organized as follows. Section "Related work" describes related work. Section "Proposed GLI" describes the proposed GLI method and presents its design idea and specific working process. Section "Experimental Results" compares the proposed GLI method and classical algorithm on different datasets and analyzes the comparison results. Finally, Section "Conclusion" summarizes the main contributions of the study.

Related work
Many factors influence the accuracy of key node identification in complex networks. In the following, a very brief survey of the methods relevant to the proposed GLI is provided. The measures considered in these methods are introduced from the global and local perspectives.
(1) Local centrality DC: This is a local centrality measure related to the number of edges connected by nodes. EC: This is a local centrality measure related to a node's degree and influence of its neighbors. PageRank 32 (PR): Similar to the EC, this measure reflects the importance of a node, which is determined by both the quantity and the quality of adjacent nodes. This is a local centrality measure using the node degree and the PR value of the neighboring nodes. ProfitLeader 33 (PL): The ProfitLeader algorithm computes the profit a node provides to the other nodes, where the importance of the node is related to the profit. This is a local centrality measure using node profit and sharing probability to its neighbors.
(2) Global centrality BC: Global centrality measure uses the number of shortest paths through the node. CC: Global centrality measure denotes the relative shortest path between the pairs of nodes.
The Sum(v i ) represents the sum of the given influence for all the adjacent nodes, Sum(v i ) calculation formula is as follows: www.nature.com/scientificreports/ Further, assume that in the considered network, node v i has the most maxD adjacent nodes to provide the given influence. Then, Sum(v i ) is divided by maxD to normalize, and the given influence of node vi on its adjacent node is obtained by: where Local(v i ) indicates local influence, and it is defined by: Definition 6 Node influence: The node influence I(v i ) depends on Global(v i ) and Local(v i ), and it is calculated as follows: Proposed model evaluation. To verify the effect of the proposed algorithm, two evaluation models, the SIR and SI models, were selected. The reliability of the comparison analysis was validated by experimental results.
SIR model. The SIR model 40 is a mathematical model describing disease transmission, which is a general standard for evaluating the accuracy of node identification.
In the SIR model, network nodes are divided into three categories as follows: Susceptible: A susceptible node is a node that is not sick but lacks the immune ability and is vulnerable to the infection after contact with a sick node. Infection: A sick node is an infected node that can infect a susceptible node. Removed: A node removed from a network, recovered (with immunity) or dead; these nodes are no longer involved in the infection process.
Next, assume that at time t, the total node number All(t) is unchanged, and nodes can be in one of the three states: susceptible to infection point Susceptible(t), sick node Infection(t), or removed from node Removed(t), and it holds that At time t, the number of infected nodes is α * Susceptible(t) * Infection(t). SI model. Similar to the SIR model, the SI model 41 is the simplest disease transmission model. In the SI model, network nodes are divided into two groups: susceptible nodes and infection nodes. At time t, a susceptible node may become an infected node with a probability of α, and this process is irreversible.
In the experiment conducted in this study, network nodes were treated as sick nodes, and the number of infected nodes was calculated using the SIR and SI models. The calculation process included multiple iterations to illustrate the infection influence of nodes. The results of the network obtained by the above two models were used as an evaluation criterion. The results of the proposed GLI algorithm and several related algorithms were compared.
Kendall coefficient. The Kendall coefficient τ is used in this study to determine the similarity between the ranking results of the proposed algorithm and those of the SIR model 42 on the same network. Assume nodes v i and v j are selected by the GLI algorithm to obtain the values GLI(v i ) and GLI(v j ). Then, node v i and v j are processed by the SIR model, and values SIR(v i ) and SIR(v j ) are obtained. If GLI(v i ) > GLI(v j ) and SIR(v i ) > SIR(v j ), or GLI(v i ) < GLI(v j ) and SIR(v i ) < SIR(v j ), then the resulting values are considered consistent, and τ = 1. If GLI(v i ) > GLI(v j ) and SIR(v i ) < SIR(v j ), or GLI(v i ) < GLI(v j ) and SIR(v i ) > SIR(v j ), then the resulting values are considered inconsistent, and τ = -1. The specific calculation formula is as follows: where X and Y represent the evaluated object, n c is the number of consistencies in two sequences, and n d indicates the number of inconsistencies in the two sequences.

Example description
In Fig. 1, the proposed algorithm is described in the example of the calculation process of the influence of node v 1 : (1) Node degree According to the idea of GLI algorithm, the node degree of node v 1 and its adjacent nodes is calculated by Eq. (1), and the results are shown in Table 1. The maximum degree is maxD = 6, and it is calculated by Eq. (2).
(2) Ks value of nodes According to Eq. (3), after the decomposition by the K-shell algorithm, the Ks value of node v 1 and its adjacent nodes is obtained, and the results are shown in Table 2.
(3) Global influence of node According to Eq. (4), it is obtained that: (4) Similarity coefficient of nodes According to Eq. (5), the similarity coefficient of node v 1 and its neighbors, denoted by J(v 1 ,v j ), is calculated, and the obtained results are shown in Table 3.

(5) Local influence
The given influence of the adjacent nodes of node v 1 is calculated by Eq. (7), and the results are shown in Table 4.
(6) Node influence According to Eq. (11), it can be obtained that:  Table 2. Ks of v 1 and adjacent nodes. Table 3. The similarity coefficient values of node v 1 and its adjacent nodes.  www.nature.com/scientificreports/ Following the above-presented steps, the influence values of all nodes in Fig. 1 are calculated, as shown in Table 5.
Data description. Twelve real representative networks were selected to evaluate the proposed GLI algorithm, and they are as follows: 1. Blogs network 43  The relevant property statistics of the experimental datasets are given in Table 6.

Experimental result.
To evaluate the applicability of the GLI algorithm, nine typical algorithms were implemented by Python, and the experiments were performed on ten datasets of different sizes. The experimental hardware platform included a Lenovo desktop computer, a CPU: i5-10,100, a memory of 32 GB; the software environment was Spyder (Python 3.7.3).
Experimental results comparison with the SIR model.
(1) Kendall value analysis   Fig. 2. The Kendall τ values of the GLI was the highest at all infection probabilities in the Protein network. When calculating the influence coefficient, the KBKNR algorithm takes the number of neighbor nodes as the divisor. While in the protein network, for nodes with a presence degree of zero, the algorithm cannot run correctly. In the six infection networks, the Blogs, Ca-Astroph, Friendships, EmailEU32430, Reactome and USAir2010 networks, the GLI algorithm performed better than the other algorithms. Among the above seven networks, the values of the maximum degree is relatively large, the values of the average degree was relatively small, and the distinction degree of the nodes' degree values was large. Therefore, GLI has a better performance in these networks.
In the Brightkite network, the GLI, GIN, and KBKNR algorithms were superior to the other algorithms. In the Polbooks and Karate networks, the Kendall τ value of the GLS algorithm was higher than that of the GLI algorithm, but the GLI algorithm performed better than the other algorithms. In Brightkite, Polbooks and Karate networks, the distinction degree of the nodes' degree values was small. Therefore, the advantages of the GLI algorithm are not obvious. www.nature.com/scientificreports/ There are strong relationships between the hierarchical measure, the centrality measure, and the topological properties of the network 54 . In Jazz and football networks, the connections between local nodes are relatively dense and have an obvious community structure. The distinction between K-shell value and degree value is not high. Therefore, the GLI algorithm does not work well in these two networks.
(2) Optimal algorithm under different infection probabilities As illustrated in Fig. 3, the GLI algorithm achieved a maximum value of 51.67% at different infection probabilities on the 12 networks. The maximum results of the other algorithms were as follows: 16.67% for the GLS algorithm, 10% for the KBKNR algorithm, 12.5% for the EC algorithm, 8.33% for the GIN algorithm, and 0.83% for the PL algorithm. In addition, the GLI algorithm performed well on all networks.

(3) Top-15 important nodes in different networks
Without a loss of generality, in this experiment, the SIR model infection probability α was set to 0.02, and the recovery probability was set to one. First, the results of different algorithms on the network datasets were   34  34  1  34  1  1  1  1  1  1  34  34  34   1  1  34  1  34  3  2  34  34  34  1  1  1   33  3  33  33  33  34  3  3  33  33  33  33  33   3  33  3  3  3  32  4  33  3  3  3  3  2   2  2  2  2  www.nature.com/scientificreports/ obtained and compared with the SIR model. Then, the 15 most influential nodes were extracted from the results. Finally, the algorithms' performances were analyzed by ranking the nodes. The first 15 nodes of the Karate, Jazz, and Ca-Astroph networks in the large, medium, and small three-type networks were selected and illustrated. As presented in Table 7, the GLI, EC, GLS and PR algorithms achieved identical results for 14 nodes out of the first 15 nodes of the SIR model. The GLI, GLS and PR algorithms were rank-aligned with the top-three nodes of the SIR model, achieving the best results among all algorithms. The EC algorithm ranked the first two nodes of the SIR correctly and was the second-best performing algorithm, following the GLI, GLS and PR algorithms. Table 8 shows that the top-15 nodes of the Jazz network were ranked, and the PL algorithms achieved the best results. Fourteen nodes out of 15 nodes were the same as those of the SIR model, and the ranking of the first four nodes was completely consistent with the SIR model. The GLI algorithm ranked 12 nodes out of the 15 nodes the same as in the SIR model, and the first four nodes were identical to the first four nodes of the SIR model. Although the GLI algorithm performed poorly compared with the PL algorithm, it achieved better results than the other algorithms, which indicated that the GLI algorithm was effective.

DC EC PL PR BC CC K-shell GIN KBKNR RLGI GLS GLI SIR
As shown in Table 9, for the first 15 nodes of the Ca-Astroph network, 13 nodes of the GLI, DC, and GIN algorithms were the same as those in the SIR model, and the importance of the first 15 nodes was basically the same. However, the GLI and GIN algorithms had the same two nodes as the SIR, GLI, and GIN algorithms, which achieved the best results and were followed by the DC algorithm. For the PR and CC algorithms, 12 nodes and 11 nodes out of 15 nodes were identical to those in the SIR model, respectively. For the Ca-Astroph network, the worst-performing algorithms were the K-shell and KBKNR, having only one node identical to the SIR model nodes.  135  59  59  135 135 135 34  135  59  135  59  59  59   59  131 135 59  152 59  59  59  167  59  131  135  135   131  135 131 167 59  167 97  131  131  167  135  131  131   167  167 167 131 148 69  98  167  98  148  167  167  167   69  107 107 148 167 82  99  69  107  95  107  107  98   98  98  69  69  166 131 100  107  121  131  98  98  69   107  130 98  166 188 193 107  98  130  166  130  130  107   82  69  130 82  114 121 130  82  134  69  69  121  82   157  82  193 95  95  173 131  193  99  152  100  69   www.nature.com/scientificreports/ Consequently, different algorithms had different advantages for different networks. However, the proposed GLI algorithm performed generally the best among all algorithms on the above-presented three networks, having the most obvious advantages.  www.nature.com/scientificreports/ were arranged in descending order according to their importance values obtained by each algorithm. Then, the sequence of infection values was obtained from the node ranking results. It should be noted that if the ranking results of the algorithm were consistent with the results of the SIR model, a curve with a smooth downward trend from left to right would be formed. The results of a single node denoted as a seed node according to its infection value obtained by different algorithms are presented in Fig. 4, where the abscissa represents the number of infected nodes in the network obtained by each of the algorithms, and the ordinate represents the number of nodes infected and recovered at time t. In Fig. 4, the data of the Polbooks, Jazz, Football, and Karate networks, which were small networks, are displayed on the linear scale; for the remaining eight networks, which had a large number of nodes, the data are displayed on the logarithm scale, focusing on the most influential nodes. As shown in Fig. 4, for the Blogs network, the result of the GLI algorithm showed an overall smooth decreasing trend, with the least number of peaks among all the algorithms. For the Ca-Astroph, Friendships, Brightkite, Reactome and USAir2010 networks, the results of the GLI algorithm had a few peaks, indicating that individual nodes were biased, but the proposed GLS algorithm's results had the best effect among all the algorithms. For the EmailEU32430 network, the GLI, GLS, KBKNR, and K-shell algorithms performed well, but the curve decline of the results of the KBKNR and K-shell algorithms was reduced, the proposed GLS algorithm's results fluctuated less and had the best effect among all the algorithms. Further, for the Polbooks network, the proposed GLI algorithm's results fluctuated less and had the best effect among all the algorithms. For the Jazz network, the right part of the curve formed by the GLI algorithm had the least fluctuation and the best effect among all the algorithms. However, for the protein network, the KBKNR algorithm could not run, so its curve is not shown in Fig. 4, and among the remaining algorithms, the GLI algorithm achieved the best results. Therefore, the GLI method performed the best among the ten networks on the Blogs, Ca-Astroph, Friendships, EmailEU32430, Polbooks, Jazz, Protein, USAir2010, Reactomeand and Brightkite networks. The data curves in the stacked map showed a smooth downward trend, which was consistent with the SIR model results. For the Football network, due to the small difference in the degree value between the nodes, the curves of all algorithms showed certain fluctuations. The fluctuations of the KBKNR and EC algorithms were small, and their effect was relatively good. In the Karate network, except for the obvious curve fluctuations of the K-shell, CC, BC, and PR algorithms, the other algorithms showed a smooth downward trend, with a slight difference.
Consequently, the proposed GLI algorithm performed the best among all the algorithms on most networks, having similar results as the SIR model. Thus, the proposed algorithm could accurately identify key nodes in the networks.
Experimental results comparison with the SI model. To analyze the performance of the proposed algorithm further, the SI model was used to evaluate the key nodes identified by different algorithms. Due to limited space, only the Kendall values obtained by the algorithms are presented in this section.
The value of the infectious probability α plays an important role in the experiment. The infected rate is (1/2) θ (here we set θ = 3) 55,57 .
The Kendall values obtained by different algorithms for different networks are presented in Fig. 5. As shown in Fig. 5, the Kendall τ value of the GLI algorithm was higher among all the algorithms for the Blogs, Friendships, USAir2010, Protein, Brightkite and EmailEU32430 networks. In the Jazz, Football and Karate networks, the Kendall τ values of the GIN algorithm were superior to the other algorithms. In the Polbooks and Reactome networks, the Kendall τ values of the KBKNR algorithm were highest. Only in the Ca-Astroph network, the Kendall τ values of the CC algorithm were highest. Consequently, the proposed GLI algorithm performed the best among all the algorithms on most networks. Infection capability of the top 15 nodes. In order to validate the effectiveness of the GLI algorithm, we have calculated the infection ability of the top 15 nodes of the GLI and other algorithms in the SIR model. In the experiment, the infection probability α has been set to 0.01, and the recovery probability β has been set to one, the time step has been set from 1 to 30, and the number of iterations has been set as 1000.
As shown in Fig. 6, the number of infected nodes F(t) increased with the increasing time step t, and finally it reached a stable value at time step t = 10. This indicated that the top 15 influential nodes effectively infected other nodes in a short time. In the eight networks, namely the Blogs, Friendships, EmailEU32430, Polbooks, Karate, Protein, USAir2010 and Brightkite networks, the top 15 nodes of the GLI algorithm had the strongest infection ability. In the Ca-Astroph network, the infection ability of the top 15 nodes of GLI algorithm and DC algorithm were similar and better than other algorithms. In the Jazz and Reactome networks, the top 15 nodes of the RGLI algorithm and the PR algorithm had similar infection abilities, with distinct advantages over the GLI algorithm. But the GLI algorithm performed better than the other algorithms. In the football network, the effect of the GLI algorithm was general. Therefore, the top 15 nodes which selected by the GLI algorithm were stronger than other algorithms in the majority of networks. The result further demonstrated the effectiveness and accuracy of the proposed algorithm.
Time complexity analysis. The time complexity analysis was performed considering the procedure performed by the proposed GLI algorithm, including three stages. The temporal complexity analysis results are described below.
First, the network was stored in the logical form of an N × N matrix, and the computation degree needed to go through the other (n − 1) nodes, and the time complexity was O(n 2 ). Further, the time complexity of calculating the K-shell value was O(n*logn). Next, the given value of the adjacent node was calculated, and this www.nature.com/scientificreports/ process involved the adjacent node; the time complexity of this process was O(n*(n − 1)) when the network was a complete network. The total time complexity was calculated as the maximum of the above three time complexity values, and the time complexity of the proposed GLI algorithm was O(n 2 ).

Conclusion
This paper proposes an efficient and accurate algorithm for key node identification in a complex network named the GLI algorithm. The GLI algorithm first calculates the Ks value of a network node, which is expressed as a global influence. Then, the local influence is obtained considering the node degree, the adjacent node's node degree value, and the adjacent node's Ks value and introducing a similarity coefficient between the adjacent nodes. Finally, the node influence is calculated based on the global and local influence results. The proposed algorithm is verified by experiments. It is compared with the other related algorithms using the results of the SIR and SI models as the evaluation index. Based on the experimental results of the nine algorithms on ten networks, the proposed GLI method performs better than the other algorithms.